May 10, 2021

Overview of presentation

  1. Introduction to COVID-19 World Vaccine Adverse Reactions Dataset

  2. Project work flow

  3. Project methods

    3.1 Overview of important packages and verbs used

    3.2 Challenges and solutions - Load, Clean and Augment

  4. Visualizations

  5. Modeling

  6. Conclusion and discussion

COVID-19 World Vaccine Adverse Reactions

COVID-19 World Vaccine Adverse Reactions

COVID-19 World Vaccine Adverse Reactions

PATIENTS.CSV: Contains information about the individuals that received the vaccines

## # A tibble: 3 x 35
##   VAERS_ID RECVDATE  STATE AGE_YRS CAGE_YR CAGE_MO SEX   RPT_DATE   SYMPTOM_TEXT
##   <chr>    <chr>     <chr>   <dbl>   <dbl>   <dbl> <chr> <date>     <chr>       
## 1 0916600  01/01/20… TX         33      33      NA F     NA         "Right side…
## 2 0916601  01/01/20… CA         73      73      NA F     NA         "Approximat…
## 3 0916602  01/01/20… WA         23      23      NA F     NA         "About 15 m…
## # … with 26 more variables: DIED <chr>, DATEDIED <chr>, L_THREAT <chr>,
## #   ER_VISIT <chr>, HOSPITAL <chr>, HOSPDAYS <dbl>, X_STAY <chr>,
## #   DISABLE <chr>, RECOVD <chr>, VAX_DATE <chr>, ONSET_DATE <chr>,
## #   NUMDAYS <dbl>, LAB_DATA <chr>, V_ADMINBY <chr>, V_FUNDBY <chr>,
## #   OTHER_MEDS <chr>, CUR_ILL <chr>, HISTORY <chr>, PRIOR_VAX <chr>,
## #   SPLTTYPE <chr>, FORM_VERS <dbl>, TODAYS_DATE <chr>, BIRTH_DEFECT <chr>,
## #   OFC_VISIT <chr>, ER_ED_VISIT <chr>, ALLERGIES <chr>

Dimensions:

dim(patients)
## [1] 34121    35

COVID-19 World Vaccine Adverse Reactions

VACCINES.CSV: Contains information about the received vaccine

## # A tibble: 3 x 8
##   VAERS_ID VAX_TYPE VAX_MANU VAX_LOT VAX_DOSE_SERIES VAX_ROUTE VAX_SITE VAX_NAME
##   <chr>    <chr>    <chr>    <chr>   <chr>           <chr>     <chr>    <chr>   
## 1 0916600  COVID19  "MODERN… 037K20A 1               IM        LA       COVID19…
## 2 0916601  COVID19  "MODERN… 025L20A 1               IM        RA       COVID19…
## 3 0916602  COVID19  "PFIZER… EL1284  1               IM        LA       COVID19…

Dimensions:

dim(vaccines)
## [1] 34630     8

COVID-19 World Vaccine Adverse Reactions

SYMPTOMS.CSV: Contains information about the symptoms experienced after vaccination

## # A tibble: 3 x 11
##   VAERS_ID SYMPTOM1      SYMPTOMVERSION1 SYMPTOM2   SYMPTOMVERSION2 SYMPTOM3    
##   <chr>    <chr>                   <dbl> <chr>                <dbl> <chr>       
## 1 0916600  Dysphagia                23.1 Epiglotti…            23.1 <NA>        
## 2 0916601  Anxiety                  23.1 Dyspnoea              23.1 <NA>        
## 3 0916602  Chest discom…            23.1 Dysphagia             23.1 Pain in ext…
## # … with 5 more variables: SYMPTOMVERSION3 <dbl>, SYMPTOM4 <chr>,
## #   SYMPTOMVERSION4 <dbl>, SYMPTOM5 <chr>, SYMPTOMVERSION5 <dbl>

Dimensions:

dim(symptoms)
## [1] 48110    11

Project workflow

  1. Load data sets (patients, vaccines, symptoms)
  2. Clean each data set individually
  3. Augment and merge the data sets
  4. Make visualizations
  5. Do modeling

Project methods - Important packages and verbs

Load and clean

  • readr: read_csv(), write_csv()
  • dyplyr: filter(), select(), distinct(), mutate()
  • tidyr: replace_na()

Augment

  • dplyr: filter(), select(), mutate(), case_when(), arrange(), group_by(), count(), distinct(), summarise(), drop_na(), rename()
  • tidyr: pivot_longer(), pivot_wider(), inner_join(), full_join(), pluck()
  • stringr: regular expressions, str_c(), str_replace(), str_replace()

Visualizations and modeling

  • ggplot: geom_bar(), geom_boxplot(), geom_tile(), geom_segment(), theme_minimal()
  • forcats: fct_reorder()
  • scales
  • patchwork
  • viridis
  • stats (?): glm(), prcomp()
  • broom: tidy(), glance()
  • purrr: map(), nest()

Project methods - Challenges and solutions - 01_load

Patients, vaccines and symptoms data sets:

  • Multiple large files → keep them compressed as gz-files and only decompress when reading into R
  • Wrong column types automatically assigned by R → manually assign appropriate column types
  • NA strings (“NA”, “N/A”, “Unknown”, " "…) → assign NAs when loading data

01_load - Challenges and Solutions 1 (DELETE SLIDE)

CHALLENGE 1: Multiple large files

SOLUTION: Keep them compressed and only decompress when reading into R:

01_load - Challenges and Solutions 2 (DELETE SLIDE)

CHALLENGE: Wrong column types automatically assigned by R

## Warning: 241 parsing failures.
##  row          col           expected     actual         file
## 1465 BIRTH_DEFECT 1/0/T/F/TRUE/FALSE Y          <connection>
## 2742 X_STAY       1/0/T/F/TRUE/FALSE Y          <connection>
## 2807 RPT_DATE     1/0/T/F/TRUE/FALSE 2021-01-04 <connection>
## 2807 V_FUNDBY     1/0/T/F/TRUE/FALSE OTH        <connection>
## 2811 RPT_DATE     1/0/T/F/TRUE/FALSE 2021-01-04 <connection>
## .... ............ .................. .......... ............
## See problems(...) for more details.

SOLUTION: Manually assign column types

01_load - Challenges and Solutions 3 (DELETE SLIDE)

CHALLENGE: NA strings (“NA”, “N/A”, “Unknown”, " "…)

SOLUTION:

Methods - Challenges and solutions - 02_clean

Patients data set:

  • Unwanted dirty/uniformative columns → select(-c(CAGE_YR, CAGE_MO, RPT_DATE,SYMPTOM_TEXT,LAB_DATA,OFC_VISIT, ER_VISIT, X_STAY, V_FUNDBY, BIRTH_DEFECT,SPLTTYPE, RECVDATE, RECOVD, L_THREAT))

  • NAs that should be interpreted as “no” → replace_na()

    • Examples: ALLERGIES and OTHER_MEDS columns
  • Row duplications → distinct()

Methods - Challenges and solutions - 02_clean

Vaccine data set:

  • Contains non-COVID19 vaccines → filter(VAX_TYPE == “COVID19”)
  • Row duplications → distinct()
  • Duplicated IDs → add_count(VAERS_ID) %>% filter(n == 1) %>% select(-n)
  • Contains vaccines of unknown manufacturer → filter(VAX_MANU != “UNKNOWN MANUFACTURER”)
  • Inconsistent naming of vaccines → recode()
  • Redundant and dirty columns → select(-c(VAX_NAME, VAX_LOT))

Symptoms data set:

  • SYMPTOMVERSION1-5 columns are unneccessary → select(-c(SYMPTOMVERSION1, SYMPTOMVERSION2, SYMPTOMVERSION3, SYMPTOMVERSION4, SYMPTOMVERSION5))

02_clean - Challenges and Solutions 1 (DELETE SLIDE)

I am aware of how horrible this table is :/

CHALLENGE SOLUTION
Unwanted columns select(-c())
NAs that should be interpreted as “no” replace_na()
Row duplications distinct()
Individuals who got more than one vaccine type (generates noise) add_count(VAERS_ID) %>% filter(n==1) %>% select(-n)

Project methods - Challenges and solutions - 03_augment

Patients data set:

  • Columns containing long string descriptions → Make tidy categorical variables
    • ALLERGIES → HAS_ALLERGIES (Y/N)
    • CUR_ILL → HAS_ILLNESS (Y/N)
    • CUR_ILL → HAS_COVID (Y/N)
    • HISTORY → HAD_COVID (Y/N)
    • PRIOR_VAX → PRIOR_ADVERSE (Y/N)
    • OTHER_MEDS →
      • TAKES_ANTIINFLAMATORY (Y/N)
      • TAKES_STEROIDS (Y/N)
    • AGE_YEARS → AGE_CLASS
    • DATEDIED, VAX_DATE → DIED_AFTER
  • Uninformative variable names → mutate(), case_when()
  • Rows with negative values in DIED_AFTER → filter()
  • Dirty, redundant and uninformative columns → select(-c(ALLERGIES, CURR_ILL, HISTORY, PRIOR_VAX, OTHER_MEDS, VAX_DATE, DATEDIED, ONSET_DATE, TODAYS_DATE))

Project methods - Challenges and solutions - 03_augment

Vaccine data set:

  • No augmentation!

Symptoms data set:

  • There are too many symptoms to analyze → make top_n_symptoms() function and use it to extract the top 20 occurring symptoms
  • Symptoms are recorded in a way that makes later analysis difficult → turn top 20 symptoms into TRUE/FALSE columns
  • Total number of symptoms needed for later analysis → mutate() to add column (N_SYMPTOMS) with total number of symptoms for each subject

Project methods - Challenges and solutions - 03_augment

Merged data sets:

  • Clean and augmented patients, vaccines and symptoms data sets → inner_join() for one wide format tibble
  • For certain types of analysis, we need the symptoms in a long-format → pivot_longer() to create:
    • SYMPTOM column: top 20 symptom names
    • SYMPTOM_VALUE column: TRUE/FALSE

03_augment - Challenges and Solutions 1 (DELETE SLIDE)

CHALLENGE: Some columns contain long string descriptions that need to be turned into something tidy

SOLUTION: Make categorical variable

03_augment - Challenges and Solutions 1 (DELETE SLIDE)

Example: ALLERGIES column:

Make categorical variable that states if patient has allergies or not:

Clean categorical HAS_ALLERGIES column:

## # A tibble: 5 x 3
##   VAERS_ID ALLERGIES                                               HAS_ALLERGIES
##   <chr>    <chr>                                                   <chr>        
## 1 0916603  Diclofenac, novacaine, lidocaine, pickles, tomatoes, m… Y            
## 2 0916604  <NA>                                                    N            
## 3 0916660  Penicillin                                              Y            
## 4 0916685  none that I am aware of                                 N            
## 5 0917437  No known allergies                                      N

03_augment - Challenges and Solutions 1 (DELETE SLIDE)

Another example: OTHER_MEDS column

Detect individuals that have taken anti-inflammatory or steroid drugs before vaccine (not recommended):

Clean, categorial TAKES_ANTIINFLAMMATORY and TAKES_STEROID columns:

## # A tibble: 4 x 4
##   VAERS_ID OTHER_MEDS                           TAKES_ANTIINFLAM… TAKES_STEROIDS
##   <chr>    <chr>                                <chr>             <chr>         
## 1 0918421  1 aspirin a day 81 mg, levothyroxin… Y                 N             
## 2 0921732  Ibuprofen - PRN  States she does no… Y                 N             
## 3 0932980  Hydrocortisone 25mg daily.  Fludroc… N                 Y             
## 4 0934539  Singulair, Oxybutynin, Fosamax, Pre… N                 Y

03_augment - Challenges and Solutions 2 (DELETE SLIDE)

CHALLENGE: Symptoms are recorded in a way that makes later analysis difficult

## # A tibble: 5 x 6
##   VAERS_ID SYMPTOM1           SYMPTOM2        SYMPTOM3 SYMPTOM4         SYMPTOM5
##   <chr>    <chr>              <chr>           <chr>    <chr>            <chr>   
## 1 0916618  Injection site pa… Pain            <NA>     <NA>             <NA>    
## 2 0916619  Injection site pa… Menorrhagia     <NA>     <NA>             <NA>    
## 3 0916620  Arthralgia         Chills          Headache Mobility decrea… Myalgia 
## 4 0916620  Nausea             Pain in extrem… Pyrexia  <NA>             <NA>    
## 5 0916621  Chills             Fatigue         Headache Myalgia          <NA>

SOLUTION: 20 most common symptoms are found and turned into TRUE/FALSE columns

## # A tibble: 3 x 21
##   VAERS_ID HEADACHE PYREXIA CHILLS FATIGUE PAIN  PAIN_IN_EXTREMITY NAUSEA
##   <chr>    <lgl>    <lgl>   <lgl>  <lgl>   <lgl> <lgl>             <lgl> 
## 1 0916600  FALSE    FALSE   FALSE  FALSE   FALSE FALSE             FALSE 
## 2 0916601  FALSE    FALSE   FALSE  FALSE   FALSE FALSE             FALSE 
## 3 0916602  FALSE    FALSE   FALSE  FALSE   FALSE TRUE              FALSE 
## # … with 13 more variables: DIZZINESS <lgl>, MYALGIA <lgl>,
## #   INJECTION_SITE_ERYTHEMA <lgl>, INJECTION_SITE_PRURITUS <lgl>,
## #   INJECTION_SITE_SWELLING <lgl>, INJECTION_SITE_PAIN <lgl>, ARTHRALGIA <lgl>,
## #   DYSPNOEA <lgl>, VOMITING <lgl>, PRURITUS <lgl>, DEATH <lgl>, RASH <lgl>,
## #   ASTHENIA <lgl>

04_analysis_visualizations

04_analysis_visualizations - Age, sex and vaccine manufacturer distribution

04_analysis_visualizations - Age distribution

04_analysis_visualizations - Age manufacturer distribution

04_analysis_visualizations - Sex and vaccine manufacturer distribution

Sex distribution
SEX n
F 24070
M 8514
NA 828
Vaccine manufacturer distribution
VAX_MANU n
JANSSEN 1106
MODERNA 16253
PFIZER-BIONTECH 16053

04_analysis_visualizations - Days until onset of symptoms vs. Age Group

04_analysis_visualizations - Age/sex vs. number of symptoms

04_analysis_visualizations - Vaccine manufacturer vs. number of symptoms

04_analysis_visualizations - Age vs. types of symptoms

04_analysis_visualizations - Sex vs. types of symptoms

04_analysis_visualizations - Vaccine manufacturer vs. types of symptoms

<<<<<<< HEAD

04_analysis_regressions

=======

04_analysis_visualizations - vaccine manufacturer vs. death

04_analysis_regressions

>>>>>>> 097986a6caba42db1b310d5d46000cb87a918f64

04_analysis_modeling

Logistic regression: death ~ patient profile

## # A tibble: 7 x 6
##   term           estimate std.error statistic  p.value odds_ratio
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
## 1 (Intercept)    -9.34      0.161    -58.0    0         0.0000876
## 2 SEXM            0.924     0.0573    16.1    2.18e-58  2.52     
## 3 AGE_YRS         0.0915    0.00207   44.2    0         1.10     
## 4 HAS_ALLERGIESY -0.100     0.0608    -1.65   9.82e- 2  0.904    
## 5 HAS_ILLNESSY    1.10      0.0664    16.6    6.60e-62  3.01     
## 6 HAS_COVIDY     -0.117     0.148     -0.791  4.29e- 1  0.890    
## 7 HAD_COVIDY      0.00915   0.193      0.0474 9.62e- 1  1.01

04_analysis_modeling

Logistic regression: death ~ patient profile

04_analysis_modeling

Logistic regression: death ~ symptoms

## # A tibble: 20 x 6
##   term          estimate std.error statistic  p.value odds_ratio
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
## 1 (Intercept)     -2.01     0.0287    -70.1  0             0.134
## 2 HEADACHETRUE    -1.67     0.156     -10.7  7.92e-27      0.188
## 3 PYREXIATRUE     -0.429    0.112      -3.82 1.34e- 4      0.651
## 4 CHILLSTRUE      -1.21     0.171      -7.11 1.17e-12      0.298
## 5 FATIGUETRUE     -0.367    0.115      -3.19 1.41e- 3      0.693
## 6 PAINTRUE        -0.913    0.153      -5.98 2.17e- 9      0.401
## 7 NAUSEATRUE      -0.621    0.139      -4.46 8.17e- 6      0.538
## 8 DIZZINESSTRUE   -2.17     0.193     -11.2  2.87e-29      0.114
## # … with 12 more rows

04_analysis_modeling

Logistic regression: death ~ symptoms

04_analysis_modeling

Many logistic regressions: each symptom ~ takes anti-inflamatory

## # A tibble: 20 x 9
##   SYMPTOM  estimate std.error statistic p.value conf.low conf.high odds_ratio
##   <chr>       <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>      <dbl>
## 1 HEADACHE  -0.170     0.0954    -1.79   0.0742   -0.361    0.0133      0.843
## 2 PYREXIA    0.0734    0.0967     0.760  0.448    -0.120    0.259       1.08 
## 3 CHILLS    -0.0727    0.103     -0.703  0.482    -0.280    0.126       0.930
## 4 FATIGUE    0.0226    0.102      0.221  0.825    -0.183    0.219       1.02 
## 5 PAIN       0.0190    0.106      0.179  0.858    -0.194    0.222       1.02 
## # … with 15 more rows, and 1 more variable: identified_as <chr>

04_analysis_modeling

Many logistic regressions: each symptom ~ takes anti-inflamatory

04_analysis_tests

04_analysis_tests

Chi-squared contingency table tests

04_analysis_clustering

04_analysis_clustering - Important tools used

Important verbs and tools used:

  • prcomp()
  • kmeans()
  • tidymodels: (used for what?)

04_analysis_clustering - PCA biplot

04_analysis_clustering - Rotation matrix

04_analysis_clustering - Scree plot

Conclusion and discussion